Overview

This tutorial covers:

  • Downloading flow data from USGS gauges

  • Basic plotting of hydrograph data

  • “Wrangling” the data into a more flexible format

  • Calculating basic hydrograph metrics

    • Flow Duration Curves

    • Annual Flood Frequency

    • Empirical return intervals

Setup

You will need to download the:
* dataRetrieval and tidyverse packages
* You only need to do this once per machine
* copy the following code into your console and run it (press enter)
* Do NOT put this code into your script

install.packages("dataRetrieval")
install.packages("tidyverse")

Start a new script

Now that we have the packages installed, we can start a new script (white square with a green plus sign icon in the top left) and begin our analysis.

Load Libraries

  • Only need to install package once
  • Need to “load” it every time
library(dataRetrieval)
library(tidyverse)

Download data

  • We will use the dataRetrieval package to download data

  • For a more complete tutorial on the dataRetrival package, see this website

  • Here, we will download data from the USGS gauge on the Colorado River near the UT-CO stateline

  • Save variables to make download easier

siteNo <- "USGS-09163500" #CO River near CO-UT stateline
# you will need to look up a site code for a different gauge for your assignment

pCode <- "00060" #discharge

# all daily data from 2003-2023
start.date <- "2003-10-01"
end.date <- "2023-12-31"
# For your assignment, make sure there is at least 10 yers of data, and change the dates as appropriate
  • Now use the read_waterdata_daily() function to download the data, and save it as an object called co_r
co_r <- read_waterdata_daily(
  monitoring_location_id = siteNo,
  parameter_code = pCode,
  time = c(start.date, end.date))
## Requesting:
## https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=json&lang=en-US&limit=10000&monitoring_location_id=USGS-09163500&parameter_code=00060&time=2003-10-01%2F2023-12-31
  • We can now print out that that data to see what it looks like:
# just first few rows:
co_r
# statistic_id == 00003 = daily
  • You can also View() the whole thing with this code:
  • Note that the following does not run in this tutorial for visualization purposes, but I will do it in class.
View(co_r)

Plot Basic Hydrograph

  • Make a plot with time in the x-axis and value on the y-axis

    • Here time is actually the date
    • value is CFS (we will change the name of the column later for simplicity)
  • Include informative title

  • add units to y-axis label

ggplot(co_r, 
       aes(x = time, 
           y = value)) +
  geom_line() +
  labs(title= "Streamflow for the Colorado River",
       subtitle = "Gage 09163500 near the CO-UT border",
       y = "Streamflow (CFS)") +
  theme_bw()

* Qualitatively describe this flow pattern
* Consider the:
* Magnitude
* Frequency
* Duration
* Timing
* Rate of change

Just plotting one year

  • Your data may have too many years to easily view the flow regime

  • We will create a y column to indicate the year

  • Then we will filter a single year value to plot that

co_r |>
  mutate(y = year(time)) |>
  filter(y == 2015) |> # Make sure you chnage this to match one year which occurs in your data
  ggplot( 
       aes(x = time, 
           y = value)) +
  geom_line() +
  labs(title= "Streamflow for the Colorado River in 2015", # make sure year in title matches
       subtitle = "Gage 09163500 near the CO-UT border",
       y = "Streamflow (CFS)") +
  theme_bw()

Data cleanup for daily FDC

  • This produces a data frame (table) with 7397 rows and 12 columns
  • Each row is the mean value for a single day
  • For our purposes, we only need the columns named:
    • time : In this case, the date in YYYY-MM-DD format
    • value which is the mean discharge (Q) for that day
      • We will rename it to be q
    • We could also keep the unit_of_measure column, bit it’s always CFS, so we will just have to remember that
  • We also want to remove the “geometry attributes”
    • Discussion of this is beyone the scope of this tutorial
# remove "geometry" attributes
attr(co_r, "class") <- "data.frame"

# overwrite co_r to just be the two columns
# and rename value column to be "q"
co_r <- co_r |>
  select(time, value)|>
  rename(q = value)
co_r

Calculate Daily Flow Duration Curves

Recall that a FDC is calculated as:

\[ \text{% Exceedance} = \frac{m}{N+1}*100 \]

where \(m\) is the rank of the flow, and \(N\) is the number of observations (number of days) in the data

  • Sort the flows so that the highest is the first row, and the smallest flows is the last row
daily_pe <- co_r |>
  # sort by the q column
  arrange(-q) |>
  # add a "rank" column
  mutate(m_rank = 1:n()) |>
  # calculate % exceedance
  mutate(PE = (m_rank / (n() +1)) *100)

# print out the new data object
daily_pe
  • plot the Flow Duration Curve
  • x = percent exceedance (PE)
  • y = q
ggplot(daily_pe, 
       aes(x = PE, 
           y = q)) +
  geom_line()

* FDC’s are usually plotted with a logarithmic y- and x-axes

ggplot(daily_pe, 
       aes(x = PE, 
           y = q)) +
  geom_line() +
  scale_y_log10() +
  scale_x_log10()

* we can also add a title and axis labels for clarity

ggplot(daily_pe, 
       aes(x = PE, 
           y = q)) +
  geom_line() +
  scale_y_log10() +
  scale_x_log10() +
  labs(title = "Daily Flow Duration Curve",
       subtitle = "CO River near CO-UT stateline",
       y = "Q (CFS)",
       x = "Percent Exceedance")

What flow and percent exceedances

  • We can use the daily_pe object to see what flow equates to a specific percent exceedance
# 10% exceedance?
daily_pe |>
  filter(PE == 10)
  • this doesn’t work, because by chance there is not an exact value of 10.0
  • But we can look for the approximate value with this:
daily_pe |>
  filter(PE >= 9.9,
         PE <= 10.1)
  • The 10% exceedance is around 11300-11500 CFS
  • That’s probably too wide of a band, so let’s add another decimal place to our filter comand
daily_pe |>
  filter(PE > 9.99,
         PE < 10.01)
  • so, the flow in the Colorado river exceeded or met 11400 CFS only 10% of the time

  • In other words, a flow of 11400 CFS was relatively rare

  • What are common flow values?

daily_pe |>
  filter(PE > 74.99,
         PE < 75.01)
  • a flow of 2870 CFS was exceeded ~75% of the time, so this value is fairly common

Calculate Annual Maximum Flow Duration Curves (AMS-FDC)

  • For this, we will be going back to our co_r data set
  • Recall that the formula for AMS-FDC is:

\[ R_{i} = \frac{m+1}{N} \] * Where \(R_i\) is the recurrence interval (in years)

  • \(m\) is rank

  • \(N\) is the number of years in the data set

Data cleanup for AMS-FDC

  • Our data right now shows the mean flow value for every day across our timeframe

  • for the AMS-FDC, we want the single largest value for every year in the dataset

  • In other words, we need to group the data by year and only keep the maximum value

  • Currently, co_r has a time column which has the date information in it

  • There are many methods for sorting and grouping data which is in a date-format (YYYY-MM-DD)

  • The method we are going to use is to make separate columns for year month and day (they will be named y, m, and d)

  • We can do this using handy functions that are a part of the tidyverse package we already loaded

Make y, m, and d columns

  • we will use the mutate() function to add new columns
  • mutate() adds a column but does not change the number of rows
co_r |>
  mutate(y = year(time))
  • What this code is doing is making a new column called y for the year
  • It it doing this by looking at the data in the time column, and extracting the value for year in that column
  • note that this only works becasue our time column was already in a date-format

How do we check the data format?

  • We can check the format in two different ways with the following code:
co_r # note the <date> notation underneath time
# also note the <dbl> notation under Q. dbl is short for "double" which means it is a number format whic can contain decimals
co_r |>
  pull(time) |>
  class() # the output here says "Date"
## [1] "Date"
  • If you follow this tutorial to complete your assignment, your data should be in the correct format
  • If you try the year() function and it doesn’t work, it most likely means the data is in the wrong format.
    • If you have this problem, come to my office hours or ask me when we have work time in class.

Continue with making separate date columns

  • There is also a month() and day() function which will extract the relevant date information
# overwrite co_r to have the date columns
co_r <- co_r |>
  mutate(y = year(time),
         m = month(time), 
         d = day(time))
# print out our new co_r to see what it looks like  
co_r
  • note that the class for our new columns are all
  • This is convenient for our upcoming calculations, but R no longer views them as “dates”
  • we will keep the time column so that we have the date information in case we need it later

Calculate Annual Maximum flows

  • Recall that we want the single largest value for each year
  • To do this, we will first “group” our data by year (y column)
  • This will create a unique group id for each year, and the following commands will all be performed on the years individually
co_r |>
  group_by(y)
  • Note that the data looks the same, but there is now a Groups: y[21] line at the top of the data output

  • this is saying that the data now contains groups based on values in the y column, and there are 21 unique groups

  • Now we summarize() the data to keep the single largest value per group/year

  • The summarize() function reduces the data to have a single row per group

# ams is short for annual maximum series
ams <- co_r |>
  group_by(y) |>
  summarize(high_flow = max(q))

# print out the new object
ams
  • The ams object now has 21 rows, one per group/year

Calculate AMS-FDC

  • To claculate the AMS-FDC, we need to calculate the rank and arrange in descending order again
ams_ri <- ams |>
  # sort by the high_flow column
  arrange(-high_flow) |>
  # add a "rank" column
  mutate(m_rank = 1:n()) |>
  # calculate the return interval, and the probability
  mutate(Ri = (n() + 1) / m_rank,
         pr = 1 / Ri)

# print out the new data object
ams_ri

Flow estimates for x-year floods

  • We can now use this data to see what flow is associated with a specific return interval

Bankfull flow

  • The annual bankfull flow generally has a return interval between 1.1 and 2 years

  • What is the bankfull flow for the Colorado River at this station?

ams_ri |>
  filter(Ri <= 2,
         Ri >= 1.1)
  • This returns a broad range of values: from 5,090-23,500 CFS

  • Let’s say the bankfull flow here has a return interval of 1.5 years.

  • We can see above that we don’t have that exact return interval, but we can filter values close to that:

ams_ri |>
  filter(Ri <= 1.6,
         Ri >= 1.4)
  • So, assuming that the 1.5 return interval is the bankfull flow, we would estimate that to be between 14,200-14,900 CFS.

10-year flood

  • What flow approximately represents the 10-year flood?

  • Recall that it is unlikely there is exactly a return interval of 10, so use bounds to capture the location

ams_ri |>
  filter(Ri <= 15,
         Ri >= 7)
  • So the 11-year flood has a Q of 38900 CFS, and the 7.3-year flood has a flow of 37200.

50-year flood

  • What flow represents the 50-year flood?
ams_ri |>
  filter(Ri <= 100,
         Ri >= 40)
  • This means that there is no data with an \(R_i\) between 40 and 100

  • Let’s see what the maximum \(R_i\) value in our data is

ams_ri |>
  filter(Ri == max(Ri))
  • In this data, the biggest return interval is 22 years, and has a flow of 46800 CFS

  • For this data, all we can say about the 50-year flood is that it is greater than 46800

  • Note that there is a method for extrapolating return intervals called the Log III Pearson regression

  • Unfortunately, this analysis is beyond the scope of this course, but I want you to be aware of its existence

  • Note that the data you find for the assignment may be more extensive than this, so you may be able to report more accurate estimates for floods with higher return intervals

Risk

  • Risk is calculated with the following formula:

\[ \text{Risk} = 1 - (1 - P)^n \] * What’s the risk of having a flood with a 0.15 probability in the next 2 years?

  • \(P = 0.15\) and \(n = 2\)

  • We can use R as a fancy calculator and plug these numbers in

1 - (1 - 0.15)^2
## [1] 0.2775
  • So there is a 0.2775 or 27.75% chance of having that flood occur in the next two years

  • What’s the risk of having a flood with a return interval of 50 years in the next 10 years?

  • here \(n = 10\) but we have to calculate \(P\)

\[ P = 1 / T \] * Where \(T\) is the return interval in years

  • So, the risk can be calculated as:
1 - (1 - (1/50))^10
## [1] 0.1829272
  • So the probability that a 50-year flood happens in the next 10 years is 0.183, or an 18.3% chance

The end

  • You should now have everything you need to complete the Hydrology assignment